Department of Information Management, Peking University, Research Center for Digital Humanities, Peking University, Institute for Artificial Intelligence, Peking University
Abstract:LLM-based research agents have advanced rapidly in science and engineering, where research is organized around executable experiments, code, and quantitative signals. Humanities scholarship, however, requires a different mode of reasoning: interpretive, evidence-grounded argument over primary sources, where scholarly value depends on faithful quotation, verifiable provenance, and close reading. Existing research agents remain largely optimized for execution and retrieval, not evidence-grounded interpretive reasoning. To address this gap, we introduce SPIRE (Scholarly-Primitives-Inspired Research Engine), a multi-agent framework for evidence-grounded humanities scholarship. Drawing on Scholarly Primitives theory, SPIRE casts recurring humanities operations as cooperating agent roles (source discovery, evidence annotation, comparison, provenance checking, sampling, citation binding, and argumentative synthesis) over a multi-scale close-reading substrate of passages, intra-context graph communities, and cross-context semantic clusters. On a peer-reviewed-paper benchmark over classical Chinese and Greco-Roman Latin scholarship, SPIRE recovers cited primary-source evidence more reliably than Naive LLM, Text RAG, and GraphRAG, and receives higher blind-judge scores on answer accuracy, depth, coverage, and evidence quality. Ablations show that both the scholarly-operation agents and close-reading retrieval contribute to evidence-grounded essays. Code, data catalogues, and reproduction scripts are released at https://github.com/YatingPan/SPIRE.
Abstract:The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.
Abstract:Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.
Abstract:Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that employs (i) a Granger-supervised, causal-aware gated fusion module, which learns to open toward event text only when it provides incremental predictive value beyond historical prices, and (ii) a multi-granularity alignment mechanism that jointly aligns high-level event representations and fine-grained textual cues with future market trajectories. Built as a flexible, plug-and-play adapter on top of off-the-shelf large language models and time-series foundation models, GS-Fuse can be instantiated across diverse backbones and market settings. Extensive experiments on real-world financial datasets show that GS-Fuse consistently outperforms state-of-the-art time-series and multimodal baselines across multiple assets and forecasting horizons.
Abstract:The pinching-antenna systems (PASS), which dynamically activate and relocate the pinching-antennas (PAs) along the dielectric waveguide, offer unprecedented potential for integrated positioning and communication. The multi-waveguide-based uplink positioning approaches for indoor environments are first proposed in this paper, and the downlink communication performance is analyzed. Two possible scenarios, multi-waveguide single-PA (MWSP) and multi-waveguide multi-PA (MWMP), are considered under the assumptions of line-of-sight channels and a single, stationary user. For the MWSP scenario, the received signal strength indication (RSSI)-based ranging method and the MWSP-based least square (LS) positioning algorithm are developed. To gain deeper insights, a comprehensive error analysis of the LS positioning algorithm is conducted. Subsequently, for the MWMP scenario, the closed-form expression of the superposed signal is derived. According to the signal power, the MWMP-based grid search algorithm is proposed and the estimation error of proposed algorithm is analyzed. Then, based on the user's positioning result, the PAs are relocated to provide downlink communication service, and the achievable data rate of MWSP and MWMP scenarios are analyzed. Numerical results validate the correctness of our analysis, which show that: i) For the MWSP scenario, a smaller geometric dilution of precision (GDoP) leads to a lower average positioning error. Furthermore, even when the GDoP is large, the regions where the distances to PAs are nearly equal achieve the best accuracy. ii) For the MWMP scenario, non-parallel waveguide deployment improves positioning accuracy, although errors increase with the number of PAs. iii) The noise has a serious double-impact on data rate. There is a trade-off between positioning accuracy and communication performance.
Abstract:Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.
Abstract:Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.
Abstract:Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.
Abstract:Accurate channel estimation remains challenging in high-mobility wireless systems because Doppler shifts induce severe inter-carrier interference (ICI) in Orthogonal Frequency Division Multiplexing (OFDM). We propose an unsupervised online channel estimation framework based on Implicit Neural Representation (INR). Unlike discrete-grid estimators, the proposed method decouples channel representation from the OFDM sampling resolution by modeling the time-varying frequency-selective channel as a continuous function of time-frequency coordinates. A Sinusoidal Representation Network (SIREN) with Gaussian Fourier feature mapping captures fine-grained channel variations and high-frequency details without offline pre-training or labeled data. For each received slot, the network parameters are updated by per-slot online fitting that minimizes a physics-aware ICI loss, while a confidence-aware decision-directed loop balances reliable pilots and dynamically harvested pseudo-pilots. Simulations in realistic Vehicle-to-Everything (V2X) environments show that the proposed method achieves near-optimal link-level reliability, significantly outperforming Least Squares (LS) and robust Linear Minimum Mean Square Error (LMMSE) estimators. Compared with supervised deep learning baselines, it also exhibits strong out-of-distribution (OOD) robustness under environmental distribution shifts, establishing an adaptable data-efficient physical-layer paradigm.
Abstract:Large language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi-agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis-specialization. We present SkillMAS, a non-parametric framework for adaptive specialization in multi-agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence-gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command-line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post-deployment specialization is attributed, updated, and applied.